A Corpus � Based Approach to Language Learning Eric Brill

نویسندگان

  • Benjamin Franklin
  • Eric Brill
  • Mark Liberman
چکیده

A CORPUS BASED APPROACH TO LANGUAGE LEARNING Eric Brill Supervisor Mitchell Marcus One goal of computational linguistics is to discover a method for assigning a rich struc tural annotation to sentences that are presented as simple linear strings of words meaning can be much more readily extracted from a structurally annotated sentence than from a sentence with no structural information Also structure allows for a more in depth check of the well formedness of a sentence There are two phases to assigning these structural annotations rst a knowledge base is created and second an algorithm is used to generate a structural annotation for a sentence based upon the facts provided in the knowledge base Until recently most knowledge bases were created manually by language experts These knowledge bases are expensive to create and have not been used e ectively in structurally parsing sentences from other than highly restricted domains The goal of this dissertation is to make signi cant progress toward designing automata that are able to learn some struc tural aspects of human language with little human guidance In particular we describe a learning algorithm that takes a small structurally annotated corpus of text and a larger unannotated corpus as input and automatically learns how to assign accurate structural descriptions to sentences not in the training corpus The main tool we use to automati cally discover structural information about language from corpora is transformation based error driven learning The distribution of errors produced by an imperfect annotator is examined to learn an ordered list of transformations that can be applied to provide an accurate structural annotation We demonstrate the application of this learning algorithm

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Brill’s rule-based PoS tagger

Eric Brill introduced a PoS tagger in 1992 that was based on rules, or transformations as he calls them, where the grammar is induced directly from the training corpus without human intervention or expert knowledge. The only additional component necessary is a small, manually and correctly annotated corpus the training corpus which serves as input to the tagger. The system is then able to deriv...

متن کامل

Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging

Recently, there has been a rebirth of empiricism in the field of natural language processing. Manual encoding of linguistic information is being challenged by automated corpus-based learning as a method of providing a natural language processing system with linguistic knowledge. Although corpus-based approaches have been successful in many different areas of natural language processing, it is o...

متن کامل

Exploring the Statistical Derivation of Transformational Rule Sequences for Part-of-Speech Tagging

Eric Brill in his recent thesis (1993b) proposed an approach called "transformation-based error-driven learning" that can statistically derive linguistic models from corpora, and he has applied the approach in various domains including part-of-speech tagging (Brill, 1992; Brill, 1994) and building phrase structure trees (Brill, 1993a). The method learns a sequence of symbolic rules that charact...

متن کامل

A Corpus-Based Approach to Language Learning

One goal of computational linguistics is to discover a method for assigning a rich structural annotation to sentences that are presented as simple linear strings of words; meaning can be much more readily extracted from a structurally annotated sentence than from a sentence with no structural information. Also, structure allows for a more in-depth check of the well-formedness of a sentence. The...

متن کامل

Concordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms

In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993